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ABSTRACT 

In this paper, we present OMP2MPI a tool that generates 
automatically MPI source code from OpenMP. With this 
transformation the original program can be adapted to be 
able to exploit a larger number of processors by surpass¬ 
ing the limits of the node level on large HPC clusters. The 
transformation can also be useful to adapt the source code 
to execute in distributed memory many-cores with message 
passing support. In addition, the resulting MPI code can be 
used as an starting point that still can be further optimized 
by software engineers. The transformation process is focused 
on detecting OpenMP parallel loops and distributing them 
in a master/worker pattern. A set of micro-benchmarks have 
been used to verify the correctness of the the transformation 
and to measure the resulting performance. Surprisingly not 
only the automatically generated code is correct by construc¬ 
tion, but also it often performs faster even when executed 
with MPI. 

Categories and Subject Descriptors 

D.3.2 [Language Classifications]: Concurrent, distributed, 
and parallel languages; D.3.4 [Processors]: Translator writ¬ 
ing systems and compiler generators 

General Terms 

Parallel Computing 

Keywords 

Source to Source Compiler, Shared Memory, MPI, Parallel 
Computing, Program Understanding, Compiler Optimiza¬ 
tion 

1. INTRODUCTION 

One of the strengths of the OpenMP paradigm is the sim¬ 
plicity of its programming model. In it, the invocation of 
communication primitives are hidden from the programmer 
as they are implicitly introduced by compilation directives 


working in conjunction with the OpenMP run time. How¬ 
ever, its use is usually limited to shared memory systems. 
Large HPC systems (like ones in top500 list) are often cre¬ 
ated by replicating nodes that contain some memory and a 
number of sockets with multicore processors or accelerators 
that can access the memory on the node. Memory on remote 
nodes is not usually visible in the address space on applica¬ 
tions running in one node. This makes OpenMP limited 
to the node domain, making OpenMP applications difficult 
to scale to a larger number of nodes (and cores) without 
introducing other paradigms like MPI. 

There are runtimes that can overcome this limitation usu¬ 
ally by implementing Software Distributed Shared Memory, 
but they are also transparent to the programmer and, con¬ 
sequently, do not allow any fine tuning that could be needed 
to better adapt to the potential different contexts. More¬ 
over, they cannot be generally applicable to all distributed 
memory platforms. 

On contrast, MPI is a de facto standard commonly used for 
big HPC applications. In it, the communication primitives 
must be explicitly coded. Introducing the communication 
primitives to implement the cooperation patterns makes the 
code larger and more difficult to read and understand. Ob¬ 
viously, it is more complex to learn since there are a large 
number of functions including point to point communica¬ 
tion primitives as well as collective communication primi¬ 
tives. This coding effort is justified if it is needed to execute 
on thousands of cores. MPI allows to communicate among 
cores on different nodes, and one could think that it intro¬ 
duces performance overheads at the node level compared 
with OpenMP. But this is a controversial issue with no clear 
answer as shown in [l4| [6 . 

We advocate for a different approach that would let pro¬ 
grammers use OpenMP to express the parallelism in their 
application while automatically generating a MPI equivalent 
program that can be executed in a distributed memory (DM) 
machine. A new tool (OMP2MPI) has been developed which 
transforms OpenMP source code into MPI source code. The 
resulting code is valid by construction, and can be executed 
in different kinds of DM systems, like large HPC clusters or 
distributed memory experimental processors like Intel Po¬ 
laris, Ambric, or experimental FPGA based multi-soft-cores 
(like 10 ). Another potential use is to test if there is any 
performance gain by using MPI on an application on the 


same shared memory platform. 

The paper is organized as follows, in Section [2] there is a 
review of the related work, in Section [3] compiler transfor¬ 
mations done to translate from OpenMP to MPI, the fol¬ 
lowing section present the performance obtained by several 
automatically created MPI codes from the Polybench bench¬ 
mark 1181. and finally, in Section [5] concludes with an expla¬ 
nation of the obtained results and future tool improvements. 

2. RELATED WORK 

Many source-to-source compiler alternatives have been pro¬ 
posed to the MPI programming complexity, the standard 
idea is to reuse codes implemented in OpenMP to gener¬ 
ate solutions that can be executed using distribute memory 
architectures. 

Most of the existing projects dedicated to the use of OpenMP 
codes for distributed memory architectures rely on the use 
of the software layer to manage data placements on nodes 
(Software Distributed Shared Memory Architectures). An 
example of these is OMNI OpenMP [22] and his optimization 
proposed in [5] [23], are one way to support OpenMP in a dis¬ 
tributed memory environment using a software distributed 
shared memory system (SDSM) as an underlying run-time 
system for OpenMP. Cluster-enabled OMNI OpenMP on 
SCASH is an implementation of OMNI OpenMP compiler 
for a software distributed shared memory system SCASH 
running under SCore Cluster System Software. Another 
important software system to mark is Cluster OpenMP pro¬ 
posed by Intel[l2], that one, as in the aforementioned, allow 
the use of OpenMP programs to run in clusters, even that 
was discontinued few years ago. All these solutions, based 
on software layer, can be used on distributed architectures, 
without use Message Passing Interface but need some kind 
of runtime. In contradistinction, OMP2MPI shows the gen¬ 
erated solution that will be executed on cluster to the pro¬ 
grammer, an this could be optimized, if needed, by an expert 
offering more flexibility on how will be the code executed in 
cluster. 

More similar ways to port OpenMP programs to Clusters 
are proposed in PaRADE [13] or based on OMNI compiler, 
9], based in Polaris. Both combines the software layer man¬ 
agement of data with the use of MPI primitives. 

In IH El EH- authors propose to extend OpenMP with ad¬ 
ditional clauses necessary for streamization as in our tool. 
Nevertheless, the most similar tools are proposed in[4j 5 and 
16 . Both, are source-to-source compilers as our tool, the 
first based on Cetus[7] and the second on PIPS [I generating 
solutions that could be compared to ours. 

OMP2MPI is based on Mercurium Framework since it sup¬ 
ports C/C++ source codes and gives an intermediate rep¬ 
resentation more friendly to work than the other existing 
frameworks as LLVM [15] , PIPS , Cetus or ROSE 20 . And 
have a well documented API that allows to extend that one. 

3. OMP2MPI COMPILER 

OMP2MPI is a Source to Source compiler (S2S) based on 
BSCs Mercurium framework 17 that generates MPI code 
from OpenMP. Mercurium [ 3 ] gives us a source-to-source 


compilation infrastructure aimed at fast prototyping and 
supports C and C++ languages. This platform is mainly 
used in the Nanos environment to implement OpenMP but 
since it is quite extensible it has been used to implement 
other programming models or compiler transformations as 
has been demonstrated in 2 1] , providing OMP2MPI with an 
abstract representation of the input source code: the Ab¬ 
stract Syntax Tree(AST). AST provides an easy access to 
source code structure representation, the table of symbols 
and the context of these. 

The specialization of Mercurium for OMP2MPI compiler is 
achieved using a plugin architecture, where plugins represent 
several phases of the compiler. These plugins are written in 
C++ and dynamically loaded by the compiler according to 
the selected configuration. Code transformations are im¬ 
plemented to the source code which implies that there is no 
need to know or modify the internal syntactic representation 
of the compiler. 

Figure [2] shows a simplified process flow of OMP2MPI com¬ 
piler, where an OpenMP input code is transformed in an 
MPI code by the use and analysis of the AST. OMP2MPI 
detect and transform OpenMP blocks (focused in #pragma 
omp parallel for), dividing the task in MPI master and slave 
processes that will be distributed on the available cores. To 
determine the OpenMP blocks that have to be transformed 
OMP2MPI use the directives proposed in p], as is illustrated 
in the input code example shown in Table [I] 

The proposed tool is able to use the combination of peer to 
peer communication functions (MPESend, MPERecv), and 
divide the code into sequential and parallel parts with the 
use of MPI ranks. 

With these MPI functions OMP2MPI is able to create a cor¬ 
rect implementation of a MPI parallel program that in the 
studied will result similar to an MPI hand-coded version 
of the original problem. OMP2MPI transforms the original 
code doing the MPI initialization and workload distribution 
based on the process rank of the calling process in the com¬ 
municator. The master process with rank 0 will contain all 
the sequential code from the original OpenMP application 
and will manage the shared memory access being the re¬ 
sponsible to keep this updated on all the slaves as is shown 
in Figure |lb[ in contrast to the original OpenMP memory 
access represented in Figure [la] where all the created threads 
have access to shared memory. In these figures, lines in blue 
represent a read operation while lines in read represent a 
write operation. 

3.1 AST Manipulation 

The AST manipulation stage on Figure [2] is composed by 
four main steps : 1) Context Analysis, 2)Loop Analysis, 
3)Workload Distribution, 4)Finalize. 

3.1.1 Context Analysis 

To transform the original code OMP2MPI analyze the con¬ 
text where the OpenMP block is originally computed, and 
do an accurate contextual analysis of the AST for each of the 
variables needed inside it. On MPI each of the executed pro¬ 
cess manage their private variables independently and the 
main problem to transform OpenMP to MPI is on shared 
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/ / i n i t MPI and v a r s 
const i n t FT AG = 0 ; 
const i n t AT AG = 1 ; 

int partSize = ((N- 0)) / (size - 1) / 10, offset; 

if (my id == 0) { 

int followIN = 0; 
int killed = 0; 

for (int to = l;to < size; + +to) { 

MPI_Send(&followIN , 1, MPLINT, to, ATAG, MPLCOMMLWORLD) ; 

MPI Send(&partSize , 1, MPI INT , to, ATAG, MPLCOMMLWORLD); 

MPI Send(&sum [ followIN ] , partSize, MPI DOUBLE, to, ATAG, MPLCOMMLWORLD); 
followIN += partSize ; 

} 

while (1) { 

MPLRecv (& o f f s e t , 1, MPLINT , MPI_ANY_SOURCE , MPI_ANY_TAG, MPLCOMMLWORLD, &stat); 

int source = stat . MPI SOURCE ; 

MPLRecv (&; p a r t S i z e , 1, MPLINT, source, MPI ANY TAG , MPLCOMMLWORLD, &stat); 

MPI_Recv(&sum [ offset ] , partSize, MPLDOUBLE, source, MPI_ANY_TAG, MPLCOMMLWORLD, &stat); 
if ((followIN + partSize) < N) { 

MPLSend (& f o 11 o wl N , 1, MPLINT , source, ATAG, MPLCOMMLWORLD); 

MPLSend (&: p a r t S i z e , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

MPI Send(&sum [ followIN ] , partSize, MPLDOUBLE, source, ATAG, MPLCOMMLWORLD); 

} else if ((N - followIN) < partSize && (N- followIN) > 0) { 

partSize = N - followIN; 

MPLSend (&: fo 11 ow IN , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

MPLSend (&: p a r t S i z e , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

MPLSend(&sum [ followIN ] , partSize, MPLDOUBLE, source, ATAG, MPLCOMMLWORLD); 

} 

if ((followIN + partSize) > N) { 

MPI_Send(& offset , 1, MPLINT, source, FT AG, MPLCOMMLWORLD); 

killed + + ; 

} 

followIN += partSize ; 
if (killed == size - 1) 

{ 

break ; 

} 

} 

} 

if (my id != 0) { 

while (1) { 

MPI_Recv(& off set , 1, MPLINT , MPI_ANY_SOURCE, MPI_ANY_TAG, MPLCOMMLWORLD, &stat); 

if (stat . MPI TAG == ATAG) { 

MPI_Recv(&partSize , 1, MPLINT, 0, MPI_ANY_TAG, MPLCOMMLWORLD, tstat); 

MPLRecvjfcsum [ offset ] , partSize, MPLDOUBLE, 0, MPI_ANY_TAG, MPLCOMMLWORLD, &stat); 
for (int i = offset; i < offset + partSize; ++i) { 

double x = (i + 0.5) * step; 

sum [i] = 4.0 / (1.0 +x * x); 

} 

MPLSend (& offset , 1, MPLINT, 0, 0, MPLCOMMLWORLD); 

MPLSend (& p a r t S i z e , 1, MPLINT, 0, 0, MPLCOMMLWORLD); 

MPI_Send(&sum [ offset ] , partSize, MPLDOUBLE, 0, 0, MPLCOMMLWORLD); 

} 

else if (stat. MPLTAG == FT AG) { 


Table 2: Resulting piece of code from the transformation of the first OpenMP block shown in Table ^ into MPI Source Code 
Example that contains the calculation of an Array. In green inserted MPI funcions and created variables. 


vo: 

i d main ( ) 



{ 





^pragma 

omp 

parallel 

for target mpi 


for! 

[int 

i = 0; i 

<N; ++i) { 




double x 

= ( i +0.5) * step ; 




sum [ i ] = 

4.0/(1.0 + x*x) ; 


} 




^pragma 

omp 

parallel 

for reduction( + : total ) target mpi 


for 

(int 

j=0; j <N; ++j){ 



t 0 t a 

. 1 += sum 

[ j ] ; 


} 




} 






Table 1: OpenMP blocks source code example using the 
created target clause. 


variables, for this reason OMP2MPI study each of shared 
variables used inside an OpenMP block and analyze the AST 
to identify when/whether they are accessed. 0MP2MPI dis¬ 
tinguish the used variables on an OpenMP blocks into IN 
variables (variables that are read inside the block but with¬ 
out modification), OUT variables (variables that are write 


inside the block and the result of these are needed after the 
block finalization) and, INOUT variables (complains both 
cases). Figure [3] represents the first OpenMP block imple¬ 
mented in Table [I] this figure is useful to show the dif¬ 
ference between a variable x that will be read inside the 
OpenMP block without any modification (in variable), an 
sum that will be write inside the OpenMP block and will 
be necessary to have this variable updated before the next 
read of this (out variable). Depending on that information 
MPI_Send / MPERecv instructions are inserted to transfer 
the data to the appropriate slaves. 

The context analysis stage will also include the study of 
the context situation of the OpenMP block, as could be to 
detect that the OpenMP block to transform is inside a loop, 
in which case OMP2MPI will modify where will be inserted 
the initialization and the task synchronization instructions. 

3.1.2 Loop Analysis 

This stage is the dedicated to study the loop that is in¬ 
cluded in pragma omp for directive to divide correctly the 
computation of the for loop inner statements. OMP2MPI 
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double workO ; 
i nt j = 0 ; 

partSize = ((N - 0)) / (size - 1) / 10; 

if (my id == 0) { 

int followIN = 0; 
int killed = 0; 

for (int to = 1; t o < size; ++to) { 

MPLSend (&: f o 11 o wl N , 1, MPLINT, to, ATAG, MPLCOMMLWORLD); 

MPI_Send(&partSize , 1, MPLINT , to, ATAG, MPLCOMMLWORLD); 
followIN += partSize ; 

} 

while (1) { 

MPLRecv (& o f f s e t , 1, MPLINT , MPI_ANY_SOURCE , MPI_ANY_TAG , MPLCOMMLWORLD, &stat 

int source = stat . MPLSOURCE; 

MPLRecv (& p a r t S i z e , 1, MPLINT, source, MPI_ANY_TAG , MPLCOMMLWORLD, &stat); 

MPLRecv (&; workO , 1, MPI DOUBLE, source, MPI ANY TAG , MPLCOMMLWORLD, &stat); 

total += workO ; 

if ((followIN + partSize) < N) { 

MPLSend (& f o 11 o w I N , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

MPLSend (& p a r t S i z e , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

} else if ((N - followIN) < partSize && (N- followIN) > 0) { 

partSize = N - followIN; 

MPLSend (& f o 11 o w IN , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

MPLSend (&: p a r t S i z e , 1, MPLINT, source, ATAG, MPLCOMMLWORLD); 

} 

if ((followIN + partSize) > N) { 

MPI _Send(& offset , 1, MPLINT, source, FT AG, MPLCOMMLWORLD); 

killed + + ; 

} 

followIN += partSize ; 
if (killed == size - 1) 

{ 

break ; 

} 

} 

} 

if (my id != 0) { 

while (1) 

{ 

MPI_Recv(fcoffset , 1, MPLINT , MPI_ANY_SOURCE, MPI_ANY_TAG, MPLCOMMLWORLD, &stat 

if (stat. MPLTAG == ATAG) 

{ 

MPI Recv(&partSize , 1, MPLINT, 0, MPILVNY TAG, MPLCOMMLWORLD, fcstat); 

total = 0; 

for (int j = offset; j < offset + partSize; ++j) 

{ 

total += sum [ j ] ; 

} 

MPI Send(& offset , 1, MPLINT, 0, 0, MPLCOMMLWORLD); 

MPLSend (& p a r t S i z e , 1, MPLINT, 0, 0, MPLCOMMLWORLD); 

MPLSend (& t o t a 1 , 1, MPLDOUBLE , 0, 0, MPLCOMMLWORLD); 

} else if (stat. MPLTAG == FT AG) { 


MPLFinalize ( ) ; 
if (my id == 0) { 

//non parallelized source code 


); 


); 


Table 3: Resulting piece of code from the transformation of the second OpenMP block shown in Table [l] that contains the 
calculation of a reduced variable into MPI Source Code Example. In green inserted MPI functions and created variables 


do an exhaustive analysis of the for semantics understand¬ 
ing and determining which is: 1) The variable iterated, 2) 
The variable initial value, 3) The variable final value 4) The 
decrement/increment after each iteration 5) The logic com¬ 
parison operation. However, there are some cases in which 
OMP2MPI will not be able to transform loops based on the 
for loop semanics i.e complex not linear increments on iter¬ 
ator or multiple cases on condition. This cases will do that 
the studied blocks will not be transformed by OMP2MPI, 
keeping these as OpenMP blocks. 

3.1.3 Workload Distribution 

Having the context understanding and the proper loop se¬ 
mantics, OMP2MPI will divide the OpenMP block calcula¬ 
tion to work with master/slaves MPI model, by using the 
producer/consumer paradigm. OMP2MPI treated all the 
variables studied in the context analysis stage . Figures 
[4] and [5] shows how OMP2MPI divide the computation for 
each of the OpenMP block. The iterations of the OpenMP 
block will be divided in a different way depending if the 
original OpenMP block contain in his pragma directives a 


schedule clause, as static or guided. OMP2MPI divides the 
iterations of the for loop between all the available slaves. 
Figure [5] shows the division model for an OpenMP dynamic 
block while, [4] how an OpenMP static or guided block will 
be divided on master/slaves. 

Using the static division the outer loop is scheduled in a 
round robin fashion by using MPFRecv from specific ranks. 
This could lead to an unbalanced load. However, is nec¬ 
essary to have this kind of division because OMP2MPI is 
thought to be faithful with the original OpenMP code which 
could have this directive. On the other hand, in the case of 
the dynamic division, the outer loop is scheduled dynami¬ 
cally by using ANY_SOURCE MPFRecv and results more 
efficient. Trying to overcome unbalanced load, OMP2MPI 
determine the range of iterations that will compute each pro¬ 
cess on execution by dividing the number of total iterations 
by the available slaves, and this number is finally divided 
by 10, as is shown in line 4 of Table [2] Table [2] illustrate 
an example of an array computation, OMP2MPI transforms 
the first OpenMP block on Table [l] into the showed source 
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(a) Example of a memory (b) Example of the proposed 
access pattern of an OpenMP memory access pattern for 
application. Threads directly shared variables in MPI target 
access the shared memory. applications. Access to shared 

variables are centralized from 
Master node, and worker pro¬ 
cesses have to communicate 
with it to access them. 

Figure 1: Shared memory access on different architec- 
tures(Blue lines represent a read operation. Red lines rep¬ 
resent a write operation). 
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Figure 2: In blue, the functionalities already offered by the 
Mercurium framework. In green the AST manipulation pro¬ 
cess done by OMP2MPI. 



Figure 3: Context example. Is possible to see that variable 
x is written before the parallel loop and it is not accessed 
after it, so it can be labeled as an IN variable. While sum is 
both written before but not read inside, and read after the 
parallel loop, so it can be labeled as OUT. 


code. Two different rules to ensure that the calculation over 
a variable could be divided in independent executions of the 
original for loop are defined. In the case that the divided 
iterator is linear in the first dimensional access pointer in a 


write operation of a variable(i.e. var[i][j]=2*i), MPI_Send 
and MPURecv functions will transfer to the master, just the 
portion of the out variable that has to be read or has been 
modified, from offset to the actual maximum iterator. The 
other studied case is when the divided iterator is not the 
first dimensional access pointer in a write operation but is 
used as that in the any of the variables on the assign opera¬ 
tion^.e. var[i]=2*j). OMP2MPI will transfer the full array 
but just in the case that the actual iteration is the last slave 
in execution. 

The used workload distribution is not applicable to all the 
possible cases that are accepted in an OpenMP block, OMP2MPI 
is not able to divide the computation on variables with con¬ 
current accesses to a shared variable, when the iterator is on 
second pointer of in access to that one, or when the variable 
is not linearly accessed. 

An special case of INOUT variable is the variable that is 
specified as reduced variable by the OpenMP reduction clause. 

In this case, OMP2MPI determine the starting value of the 
reduced variable, depending on the reduction operation(an 
starting value of 0 for ”+” and operations, or 1 for 
and ”/”) and will accumulate on the resulting variable the 
received results computed on slaves by the use of the op¬ 
eration to reduce. Table [ 3 ] show how that is preformed by 
OMP2MPI transforming the second block on Table [l] into 
the showed source code. 

3.1.4 Finalization 

The final stage on the AST Manipulation step is the final¬ 
ization stage that is responsible to assign the remaining non 
MPI parallelized source code to the master node to avoid 
unnecessary computation, and put MPLFinalize instruction 
before that. The resulting process is illustrated in the last 
lines of Table [3] 


Master 



Figure 4: Workload static distribution. The work is sent 
in an orderly manner depending on the rank of slaves. All 
slaves has to finish before continue with the next piece of 
workload. 


4. RESULTS 



















































Master 



Figure 5: Workload dynamic distribution. The work is di¬ 
vided dynamically responding to the slave that answers with 
the following range of iterations and the variables needed to 
do the computation. 


We compiled with OMP2MPI a subset of the Polybench 
benchmark. The generated versions were executed in 64 
CPU’s E7-4800 with 2.40 GHz(Bullion quadri module) and 
compiled with bullxmpi, compatible with MPI 2.1, enhanced 
by Bull with many new features such as effective abnor¬ 
mal pattern detection, network-aware collective operations, 
and multi-path network fail-over, to increase reliability, re¬ 
silience and boost the performance of parallel MPI applica¬ 
tions. We compare the codes resulting from the execution of 
OMP2MPI with the original OpenMP ones, and also with 
a sequential version of the same problem. Figure [6] shows 
the speed-up comparison for the selected problems. This fig¬ 
ure shows that OMP2MPI produces good transformation of 
the original OpenMP code, and in most of cases the gener¬ 
ated have better scalability than the original one with linear 
speed-up increment correlated with the number of proces¬ 
sors used on execution. 

5. CONCLUSIONS 

We have presented OMP2MPI, a tool that facilitates the 
portability of an OpenMP source code to MPI, we shown 
how it effectively automatically translates OMP2MPI being 
able to go outside the node. Allowing that the program 
exploits non shared-memory architectures such as cluster, 
or NoC-based MPSoC. 

This automatic task is very useful because the programmer 
could keep working with the OpenMP model, being easily 
readable and just compile over OMP2MPI compiler to take 
the advantages of the MPI model offer (speed-up, scalabil¬ 
ity, etc.). The readability of the code generated is accept¬ 
able so that allows further optimization by an expert intend¬ 
ing to improve performance results. The experiments made 
using Polyhedral Benchmark are promising for this effort¬ 
less version and produce better scalability than the original 
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Figure 6: Speed-up of the tested problems by using 16, 32 
and, 64 processors compared with the sequential version. 


OpenMP code, Speed-up figures for 64 cores in most of cases 
are higher than 20 x compared to the sequential version, and 
also higher than 4x compared to the original OpenMP code. 
These results show again, as mentioned in the introduction, 
that OpenMP does not always perform better than MPI in 
shared memory systems. 

Future improvements on OMP2MPI will be done to include 
all the possible uses of shared variables inside OpenMP block 
and to allow the use of target mpi clauses on more OpenMP 
directives as example on critical sections that could be trans¬ 
formed by the use of MPI_AllReduce or atomic sections 
transformed to MPI_Bcast. 
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